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Multipath  remote  direct  memory  access  (RDMA)  is  a  new  networking  technology  that  provides 
an  order-of-magnitude  increase  in  bandwidth  through  a  full-mesh  backplane  and  through  a 
standard  Ethernet  switch  fabric.  This  article  first  describes  full-mesh  backplane  technology, 
RDMA  access  over  internet  protocol,  and  Multipath  RDMA,  then  introduces  two  novel 
Multipath  RDMA  instrumentation  systems:  an  instrumentation  module  and  an  instrumen¬ 
tation  blade.  The  instrumentation  module  supports  Giga-sample  per  second  (GSPS) 
instrumentation  and  operates  at  the  end  of  the  wire  using  Power-over-Ethernet  Plus.  The 
instrumentation  blade  goes  directly  into  a  Multipath  RDMA-enabled  blade  chassis,  supports 
multi-GSPS  instrumentation,  and  gives  rise  to  real-time  or  near— real-time  parallel  processing 
of  data  produced  by  multi-GSPS  instrumentation. 
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Multipath  RDMA;  power  over  Ethernet;  remote  direct  memory  access  (RDMA). 


Novel  networking,  instrumentation, 
and  computation  hardware  is  intro¬ 
duced  for  multiple  Giga-sample  per 
second  (GSPS)  data  acquisition, 
hardware-in-the  loop  simulation,  sig¬ 
nal  processing,  and  spectrum  analysis.  The  system 
builds  on  10  Gbits/s  remote  direct  memory  access 
(RDMA)  over  internet  protocol  (IP)  and  uses  full- 
mesh  backplanes  to  create  a  new  networking  technol¬ 
ogy  called  Multipath  RDMA.  A  Multipath  RDMA 
full-mesh  bridge  chip  provides  over  an  order-of- 
magnitude  increase  in  bandwidth  between  computing 
nodes  and  supports  real-time  or  near-real-time  parallel 
processing  in  multi-GSPS  instrumentation  systems. 

Full-mesh  networking 

A  fuU-mesh  bridge  chip  enables  Multipath  RDMA 
through  a  fuU-mesh  passive  backplane.  The  backplane 
incorporates  a  fuU-mesh  network  topology  into  a  passive 
backplane,  or  midplane,  which  provides  every  processor 
blade  a  direct  connection  to  every  other  processor  blade. 
Figure  la  illustrates  this  fuU-mesh  topology  while 
Figure  lb  shows  how  the  same  blades  are  also  mapped 
to  a  dual-redundant  star  topology,  the  network  topology 
used  in  a  typical  Ethernet  switch  fabric.  Each  switch 
provides  a  cut  through  switch  connection  between  any 
two  blades  in  the  system.  The  fuU-mesh  fabric  on  the 


other  hand  allows  for  a  blade  to  have  as  many 
connections  as  there  are  other  blades  in  the  backplane. 
Hence,  the  fuU-mesh  fabric  provides  much  higher 
aggregate  bandwidth  into  and  out  of  any  one  node. 

Both  the  AdvancedTCA  (PICMG®  2003)  and 
VITA  46/48  standards  (VITA  2008)  currently  define 
fuU-mesh  backplanes.  These  backplanes  support  up  to 
16  slots  with  15  serial  communications  channels  per 
slot.  Each  channel  supports  fuU-duplex  lOG  Ethernet 
with  a  10  Gigabit  physical  layer  comprised  of  eight 
differential  pairs,  each  operating  at  a  3.125  GHz 
signaling  rate.  Hence  each  slot  has  a  maximum 
150  Gbits/s  of  bandwidth  in  both  directions  through 
the  backplane.  Unfortunately,  little  exists  on  the 
market  today  that  can  use  this  bandwidth  outside  of 
the  backplane.  Current  switch  chips  support  the  full 
bandwidth  through  the  backplane  but  bottleneck  down 
to  one  or  two  lOG  Ethernet  connections  to  the 
processors  on  the  blades.  A  full-mesh  bridge  chip  with 
a  native  host  processor  interface  and  Multipath 
RDMA  capability  is  designed  to  alleviate  this  bottle¬ 
neck  and  provide  over  an  order-of-magnitude  greater 
bandwidth  between  processors. 

The  full-mesh  bridge  chip,  shown  in  Figure  2, 
provides  a  bridge  between  a  processor’s  high-speed 
host  interface  and  a  full-mesh  backplane  interface.  The 
host  interface  can  take  the  form  of  any  high-speed. 
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Figure  1.  (a)  Full-mesh  network  topology,  (b)  dual-redundant  star  network  topology 


chip-to-chip  interconnect  technology  like  Advanced 
Micro  Device’s  (AMD’s)  HyperTransport™  (HT) 
(HyperTransport  Consortium  2001),  Intel’s  Quick- 
Path  Interconnect  (QPI)  (Intel  2008),  and  Rambus’ 
FlexIO™  (Rambus  2008),  which  is  currently  used  in 
the  IBM,  Sony,  and  Toshiba  Cell  Broadband  Engine 
(Cell),  or  other  future  interconnect  standard.  Dual  host 
interfaces  support  dual-processor  blades,  with  the  fuU- 
mesh  bridge  providing  a  peak  bandwidth  of  around 
20  GBytes/s  (160  Gbits/s)  to  each  processor.  The 
backplane  interface  supports  the  maximum  fifteen 
channels,  which  means  that  two  processors  connected 
via  host  interfaces  will  be  able  to  share  the  fuU 
150  Gbits/s  through  the  backplane. 

Remote  direct  memory  access  over  IP 

In  addition  to  the  fifteen  serial  channels  to  the 
backplane,  the  full-mesh  bridge  chip  provides  two 
interfaces  for  use  as  external  lOG  Ethernet  ports  into  a 
dual  redundant  switch  fabric  as  shown  in  Figure  3.  In 
other  words,  the  fuU-mesh  bridge  chip  has  the 
equivalent  of  17  lOG  serial  connections:  two  external 
Ethernet  ports  and  15  backplane  channels.  Each 
processor  can  use  these  17  lOG  connections  to  transfer 
data  directly  from  its  memory  to  another  processor’s 
memory  on  any  other  blade  in  the  system  via  RDMA. 

RDMA  over  IP  (RDMA  Consortium  2003)  differs 
from  traditional  Ethernet  transmission  in  that  RDMA 
eliminates  unnecessary  buffering  in  the  operating 


system  when  transmitting  and  receiving  packets. 
Instead  of  copying  packets  into  a  buffer  in  the 
operating  system  before  sending,  an  RDMA  Network 
Interface  Controller  (RNIC)  takes  data  directly  from 
application  user-space  memory,  applies  the  appropriate 
network  layer  protocols  and  Ethernet  link  layer  frame 
and  sends  the  packet  across  the  network.  On  the 
receiving  end,  another  RNIC  receives  the  packet  and 
places  the  payload  directly  into  application  user-space 
memory.  By  removing  the  unnecessary  data  copying  in 
the  kernel  and  off-loading  network  protocol  processing 
at  both  ends  of  the  link,  RNICs  alleviate  the  latency 
issues  normally  associated  with  Ethernet  and  make 
Ethernet  a  viable  solution  for  high-speed,  low-latency 
instrumentation  systems. 

To  get  into  more  detail,  the  RDMA  Consortium 
(RDMA  Consortium  2003)  defined  a  suite  of 
protocols  at  the  transport  layer  that  enables  cooperat¬ 
ing  Direct  Memory  Access  (DMA)  engines  at  each 
end  of  a  network  connection  to  move  data  between 
user-space  memory  buffers  with  minimal  support  from 
the  kernel  and  with  “zero  copy”  to  intermediate 
buffers.  The  Remote  Direct  Memory  Access  Protocol 
Verbs  specification  (Recio  2007)  describes  the  behavior 
of  the  protocol  off-load  hardware  and  software,  defines 
the  semantics  of  the  RDMA  services  and  how  they 
appear  to  the  host  software,  and  defines  both  the  user 
and  kernel  application  programming  interface.  The 
Verbs  specification  defines  RDMA  READ,  WRITE, 
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Figure  2.  Full-mesh  bridge  chip 

and  SEND  operations  that  transport  data  between  user 
space  memories  or  into  a  receive  queue,  respectively, 
and  defines  send  and  receive  queues  and  queue  pairs  to 
control  data  transport  and  completion  queues  to  signal 
when  an  operation  is  complete.  Work  requests  are 
converted  into  work  queue  elements  which  are 
processed  in  turn  by  the  off-load  engine.  An 
asynchronous  event  (interrupt)  signals  when  work  is 
complete.  Also,  data  need  not  be  in  contiguous 
locations  in  either  the  source  or  destination  memories 
as  scatter-gather  lists  define  the  physical  memory 
locations  of  data  segments. 

The  fuU-mesh  bridge  includes  built-in  support  for 
RDMA  over  IP,  and  the  processors  control  the  device 
through  their  host  interface.  However,  since  the  RNIC 
accesses  processor  main  memory  directly,  the  RNIC 
must  have  a  high  bandwidth  path  to  main  memory  in 
order  to  not  bottleneck.  XDR  and  DDRS  are  the  only 
memory  interfaces  that  currently  match  the  20  GBps 
bandwidth  of  advanced  host  interfaces.  The  Cell 
Broadband  Engine,  which  uses  XDR  memory,  is  the 
only  processor  that  currently  provides  both  a  host  and 
memory  interface  at  this  bandwidth,  but  Intel  and  AMD 
should  reach  this  level  of  performance  in  the  near  term. 

Multipath  RDMA 

While  full-mesh  connectivity  provides  high  aggre¬ 
gate  bandwidth  between  multiple  processors  distribut¬ 
ed  across  the  backplane,  the  bandwidth  between  any 
two  processors  is  the  same  as  if  the  backplane  were  a 
simple  switch  because  only  one  direct  connection  exists 
between  any  two  slots  in  the  backplane.  The  same  is 
true  for  connections  between  processors  connected 
through  an  external  dual-redundant  Ethernet  switch 
fabric.  Processors  connected  across  the  dual-redundant 
switch  fabric  have  at  most  20  Gbits/s  of  bandwidth  for 
RDMA  data  transfer  (using  both  paths  through  the 
redundant  fabrics).  However,  a  new  technology  called 
Multipath  RDMA,  incorporated  into  the  bridge  chip, 
allows  two  individual  processors  located  anywhere  in 
the  network  to  use  the  full  150  Gbits/s  bandwidth 
through  the  backplane  plus  the  20  Gbits/s  bandwidth 
through  the  external  switch  fabric  for  RDMA  data 
transfer  via  multiple  parallel  paths. 


10G  Ethernet  Ports 
2x  4-Lane  Serial  PHYs 


Full-Mesh  Backplane  Interface 
15x  4-Lane  Serial  PHYs 


Figure  3.  Full-mesh  bridge  chip  with  Ethernet  ports 

Multipath  RDMA  gets  its  name  from  its  similarity 
to  multipath  in  wireless  communication.  In  wireless 
communications,  multipath  refers  to  the  multiple  paths 
traveled  by  electromagnetic  waves  between  antennas 
due  to  reflections  in  the  environment.  New  Multi- 
Input,  Multi-Output  (MIMO)  technologies  like  those 
incorporated  into  IEEE  802. lln  (IEEE  2008)  exploit 
multipath  transmission  to  increase  the  overall  data  rate. 
Similarly,  Multipath  RDMA  uses  every  possible  route 
through  both  the  fuU-mesh  backplane  and  the  dual- 
redundant  switch  fabric  to  route  packets  between  any 
two  processors.  The  most  significant  improvement 
Multipath  RDMA  offers  over  current  networking 
technology  is  that  it  provides  over  an  order-of- 
magnitude  increase  in  bandwidth  between  processors 
in  the  same  backplane  and  between  blades  connected 
through  a  dual-redundant  switch  fabric. 

First,  when  processor  blades  share  a  common  full- 
mesh  backplane,  they  have  17  available  channels  to 
transmit  data  to  any  other  processor.  Any  two  processors 
in  the  network  have  one  direct  channel  through  the  full- 
mesh  backplane  plus  two  switched  connections.  How¬ 
ever,  with  Multipath  RDMA  the  two  processors  also 
gain  14  one-hop  connections  through  the  remaining 
fuU-mesh  channels  via  the  other  14  blades’  fuU-mesh 
bridge  chips.  The  fuU-mesh  bridge  chips  act  as  cut- 
through  switches  between  the  sending  and  receiving 
fuU-mesh  bridges  on  the  source  and  destination  blades. 
Figure  4  shows  an  example  of  Multipath  RDMA 
between  nodes  in  the  same  backplane. 

Second,  when  two  blade  chassis  are  connected 
through  a  dual-redundant  switch  fabric.  Multipath 
RDMA  exploits  the  two  full-mesh  backplanes  to 
increase  the  bandwidth  through  the  switch  fabric.  On 
the  sending  side,  the  fuU-mesh  bridge  chip  segments 
the  payload  and  transmits  the  resulting  packets 
through  all  17  available  channels.  The  two  switch 
fabric  ports  send  their  packets  to  the  Ethernet  ports  on 
the  receive  side  in  the  standard  way.  The  15 
connections  through  the  backplane  interface,  however, 
route  through  the  other  15  fuU-mesh  bridge  chips  in 
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Figure  4.  Multipath  RDMA  through  a  full-mesh  backplane 


the  backplane,  which  act  as  cut-through  switches,  to 
the  switch  fabric  ports  on  each  blade.  Then  the  packets 
are  sent  across  the  switch  fabric  simultaneously  to  the 
15  blades  connected  to  the  receiving  blade  through  the 
other  backplane.  The  full-mesh  bridge  chips  on  those 
blades  then  act  as  cut-through  switches  to  the  channels 
connected  to  the  receiving  blade’s  full-mesh  bridge 
chip.  The  fuU-mesh  bridge  on  the  receiving  blade  then 
combines  the  payloads  from  each  packet  and  places  the 
data  directly  into  the  receiving  processor’s  memory. 
Hence  for  a  Multipath  RDMA  transmission  through 
the  dual-redundant  switch  fabric,  a  processor  has  two 
one-hop  switched  connections  and  15  three-hop 
connections  through  fuU-mesh  bridges.  Figure  5  shows 
an  example  of  Multipath  RDMA  between  nodes  in 
separate  fuU-mesh  backplanes. 

In  both  cases,  the  bandwidth  between  any  two  full- 
mesh  bridge  chips  is  a  multiple  of  10  Gbits/s  up  to 
the  maximum  170  Gbits/s.  The  number  of  channels 
utilized  by  Multipath  RDMA  can  change  on  the  fly 
such  that  the  number  of  communication  paths 
between  any  two  processors  is  dynamic.  As  a  result. 
Multipath  RDMA  can  support  a  large  number  of 
connections  at  dynamically  changing  bandwidths  in 
order  to  optimize  connections  for  a  wide  range  of 
algorithms.  When  compared  to  current  commercial 
off-the-shelf  systems.  Multipath  RDMA  provides 
more  than  16  times  the  maximum  bandwidth 
between  processors  connected  through  a  full-mesh 
backplane  and  more  than  eight  times  the  maximum 
bandwidth  between  processors  connected  through, 
and  utilizing,  a  dual-redundant  switch  fabric.  How¬ 
ever,  real-time  or  near-real-time  systems  can  only 


take  advantage  of  this  bandwidth  if  the  instruments 
are  able  to  obtain  the  same  amount  of  bandwidth  into 
and  out  of  the  network  fabric. 

Multipath  RDMA  instrumentation 

In  order  to  achieve  high  bandwidth  between 
instrumentation  and  a  Multipath  RDMA-enabled 
supercomputing  cluster,  a  new  instrumentation  system 
had  to  be  architected  from  the  ground  up  that  could 
incorporate  the  concept  of  using  multiple  connections 
to  achieve  higher  bandwidth.  First,  the  instrumenta¬ 
tion  systems  must  accommodate  analog- to -digital 
converters  (ADCs),  digital-to-analog  converters 
(DACs),  or  other  sensors  or  actuators  that  source  or 
sink  high  bandwidth  data  streams.  With  up  to 
170  Gbits/s  bandwidth  through  a  blade  chassis. 
Multipath  RDMA  supports  ADCs  and  DACs  with 
multi-GSPS  data  rates.  Second,  the  instrumentation 
systems  must  either  have  a  single  high  bandwidth 
connection  into  the  computing  cluster  or  several  lower 
bandwidth  parallel  connections. 

To  this  end,  designs  for  two  different  instrumenta¬ 
tion  systems  developed.  The  first  design  is  a  10/ 
20  Gbits/s  RDMA  Power-over-Ethernet  Plus  (PoEP) 
Instrumentation  Module  that  supports  ADCs  and 
DACs  up  to  2  GSPS  (at  8  bits  of  resolution).  The 
second  design  approach  puts  the  ADCs  and  DACs 
directly  onto  the  blades  in  the  supercomputing  cluster 
and  supports  sampling  rates  up  to  20  GSPS  (at  8  bits 
of  resolution).  The  first  approach  has  the  advantage  of 
placing  the  ADCs  and  DACs  close  to  the  sensors  and 
actuators,  which  eliminates  noise  on  the  analog  side 
and  decreases  the  number  of  cables.  The  second 
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Figure  5.  Multipath  remote  direct  memory  access  (RDMA)  across  a  dual-redundant  switch  fabric 


approach  has  the  advantage  of  supporting  the  fastest 
ADCs  and  DACs  currently  on  the  market  and 
provides  enough  bandwidth  into  a  full-mesh  backplane 
to  distribute  full-rate  sensor  data  across  parallel  signal 
processors  using  Multipath  RDMA. 

10/20  Gbits/s  RDMA  PoEP 
instrumentation  modules 

The  10/20  Gbits/s  RDMA  PoEP  Instrumentation 
Module  is  a  remote  high-speed  sensor  data  acquisition 
and  actuator  controller  system  that  both  transmits  data 
through  and  gets  its  power  from  a  single  CAT6  or 
CAT?  Ethernet  cable.  The  module  consists  of  a 
motherboard  and  an  XMC  card,  the  first  of  which  is 
responsible  for  handling  the  Ethernet  communication, 
IEEE  1588  time  synchronization,  and  main  power 
supplies.  The  XMC  card  contains  high-speed  ADCs 
or  DACs  and  analog  instrumentation.  With  pluggable 
XMC  cards,  the  module  is  extremely  versatile  and  can 
be  used  in  almost  any  data  acquisition,  signal 


processing,  hardware-in-the-loop,  or  other  type  of 
instrumentation  system. 

XMC  is  the  VITA  42  standard  (VITA  2008)  that 
extends  the  PCI  Mezzanine  Card  (PMC)  to  several 
different  high-speed  buses  including  PCI  Express 
(VITA  42.3)  and  H)q3erTransport  (VITA  42.4).  The 
XMC  standard  supports  up  to  16  data  lanes  per  slot. 
With  8-lane  PCI  Express  V2.0  the  XMC  slot  provides 
up  to  32  Gbits/s  of  bandwidth,  which  means  that  a 
plug-in  card  can  support  a  single  ADC  sampling  at 
4  GSPS  with  8-bits  of  resolution  or  a  2  GSPS  DAC 
with  16-bits  of  resolution  or  multiple  ADCs  or  DACs 
with  lower  sampling  rates  or  resolutions.  However,  the 
module  itself  only  supports  up  to  20  Gbits/s  into  a 
switch  fabric,  which  means  the  XMC  card  is  limited  to  a 
single  ADC  sampling  at  2  GSPS  with  8  bits  of 
resolution  or  a  1-GSPS  DAC  with  16  bits  of  resolution. 

XMC  cards  with  various  ADC  and  DAC  config¬ 
urations  are  currently  available  as  commercial  off-the- 
shelf  boards.  The  real  innovation  in  the  instrumenta- 
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Figures.  10G  PoEP  motherboard 


tion  module  is  the  motherboard.  The  motherboard 
supports  up  to  two  lOG  Ethernet  ports,  IEEE  1588, 
and  PoEP,  and  has  an  extremely  small  form  factor 
about  the  same  size  as  a  double-wide  XMC  card 
(15  cm  X  15  cm),  and  as  a  result,  can  be  placed 
relatively  close  to  the  sensors  or  actuators,  which 
reduces  the  lengths  of  the  analog  sensor  and  actuator 
wires.  Figure  6  shows  a  simplified  block  diagram  of  the 
motherboard. 

A  dual-port  RDMA  ASIC,  like  NetEffect’s  lOGbE 
ASIC  (NetEffect  Inc.,  Austin,  Texas)  for  example,  on 
the  motherboard  provides  two  10  Gbits/s  RDMA  over 
IP  Ethernet  ports.  With  RDMA  support,  the 
instrumentation  module  can  transfer  data  directly 
between  its  own  memory  and  tens,  hundreds,  or  even 
thousands  of  processor  memories  in  a  supercomputing 
cluster  through  an  Ethernet  switch  fabric.  While  only 
two  lOG  Ethernet  connections  may  not  warrant  using 
Multipath  RDMA,  it  can  be  used  on  the  server  end  to 
send  data  to  multiple  instrumentation  modules  from  a 
single  processor  or  read  data  from  multiple  instrumen¬ 
tation  modules  into  a  single  processor. 

IEEE  1588  Precision  Time  Protocol  (PTP)  (IEEE 
2002)  is  a  time  synchronization  protocol  that  allows 
the  motherboard  to  synchronize  a  local  clock  with  a 
master  clock  located  somewhere  in  the  network.  Once 
the  motherboard  clock  is  synchronized  with  the  IEEE 
1588  PTP  master  clock,  it  can  then  be  used  as  a 
reference  for  the  ADC  and  DAC  clocks  on  the  XMC 
card.  By  using  IEEE  1588  PTP  time  synchronization, 
every  instrumentation  and  processing  node  in  the 


network  can  have  clocks  synchronized  to  within  100  ns 
of  each  other  or  better. 

PoEP  is  the  IEEE  802. Sat  (IEEE  2005)  extension 
to  the  IEEE  802. 3af  (IEEE  2003)  Power-over- 
Ethernet  standard  that  allows  a  powered  device  to 
use  up  to  70  W  of  power  supplied  by  power  sourcing 
equipment.  The  IEEE  802. Sat  draft  standard  defines  a 
midspan  power  sourcing  equipment  device  that 
connects  in  series  to  insert  power  into  an  Ethernet 
link.  Since  the  motherboard  accepts  PoEP,  the 
motherboard  and  XMC  card  can  use  up  to  70  W  of 
power,  which  is  more  than  enough  power  for  most 
applications.  PoEP  products  are  currently  available  for 
Gigabit  Ethernet,  but  extending  PoEP  technology  to 
lOG  Ethernet  is  straightforward  since  lOGBase-T  uses 
the  same  4  pairs  as  lOOOBase-T. 

As  mentioned  above,  the  instrumentation  module 
connects  to  the  Multipath  RDMA  enabled  cluster  and 
receives  its  power  via  a  PoEP  midspan  device  or  switch. 
With  a  midspan  configuration  and  CAT7  cable,  each 
instrumentation  module  can  be  located  up  to  100  m 
from  a  switch.  Hence,  the  instrumentation  modules 
can  be  located  large  distances  from  the  processing 
cluster,  which  is  often  convenient  when  the  cluster 
cannot  be  physically  located  nearby.  Figure  7  shows  an 
illustration  of  the  module  network. 

The  only  disadvantage  of  the  instrumentation 
module  approach  is  the  limited  bandwidth  between 
the  ADCs  and  DACs  and  the  cluster.  With  only 
20  Gbits/ s  maximum  bandwidth  into  the  switch  fabric, 
the  modules  cannot  support  the  sampling  rate  and 
resolution  of  the  fastest  ADCs  and  DACs  currently 
available.  For  this  reason,  a  second,  faster  instrumen¬ 
tation  system  approach  was  developed  that  has 
sufficient  bandwidth  to  support  state-of  the-art  ADCs 
and  DACs  and  can  use  Multipath  RDMA  to 
distribute  data  throughout  the  cluster. 

Multipath  RDMA  instrumentation  blades 

In  order  to  achieve  the  maximum  bandwidth  into  a 
Multipath  RDMA  system,  the  ADCs  and  DACs  must 
have  a  high-speed  interface  into  a  fuU-mesh  backplane. 
A  straightforward  way  to  gain  access  to  the  fuU-mesh 
backplane  is  if  the  ADCs  and  DACs  are  on  the  blades 
themselves.  The  ADCs  and  DACs  can  be  interfaced  to 
the  fuU-mesh  bridge  using  an  FPGA,  which  buffers 
data  in  memory  and  provide  a  host  interface  to  the  fuU- 
mesh  bridge. 

FPGAs  can  communicate  with  the  fastest  ADCs 
and  DACs  on  the  market,  through  a  high  speed 
interface.  The  latest  FPGAs  provide  a  DDR3  interface 
supporting  up  to  several  Gigabytes  of  memory,  which, 
for  example,  can  be  used  to  buffer  data  from  an  ADC 
before  it  is  distributed  to  processors  in  the  cluster. 
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Figure  7.  Instrumentation  module  network  with  the  IEEE  1588  master  clock 


Finally,  the  FPGA’s  host  interface  bridges  the 
instrumentation  into  the  full-mesh  bridge.  The  fuU- 
mesh  bridge  is  then  able  to  use  Multipath  RDMA  to 
transfer  data  between  the  FPGA’s  memory  and  the 
memory  of  any  processor  in  the  cluster.  Furthermore, 
with  IEEE  1588  PTP  support  built  into  the  full-mesh 
bridge,  instrumentation  blades  and  processing  blades 
will  all  support  clock  time  synchronization  in  the  same 
fashion  as  the  instrumentation  modules.  Figure  8 
shows  a  block  diagram  of  the  instrumentation  blade 
architecture. 

Application  example 

As  a  demonstration  of  how  the  Multipath  RDMA 
instrumentation  system  works,  the  following  section 
shows  how  the  system  could  compute  a  billion-point 
complex  fast  Fourier  transform  (CFFT)  of  acquired 
sensor  data  in  near-real-time  using  the  fastest 
commercial  off-the-shelf  ADCs  and  multiple  Cell 
Broadband  Engines  from  IBM.  First,  assume  two  8- 
bit  5-GSPS  ADCs  capture  an  arbitrary  waveform 
starting  at  time  7=0  seconds  on  an  instrumentation 
blade  connected  to  a  full-mesh  backplane.  The  two 
ADCs  sample  the  waveform  in-phase  and  quadra¬ 
ture-phase  (I  Sc  Q_).  A  billion  complex  data  points 
are  sampled  every  214.7  ms  (a  billion  in  this  case  is 
actually  2^°).  This  data  rate  corresponds  to  a 
bandwidth  of  10  GBytes/s  from  the  ADCs.  Since 
the  ADC  interface  and  the  DDRS  interface  on  the 
FPGA  have  more  than  10  GBytes/s  bandwidth,  the 


data  points  are  stored  in  the  FPGA  memory  as  fast  as 
they  are  sampled. 

As  the  data  is  stored  in  memory,  the  FPGA 
simultaneously  tasks  the  fuU-mesh  bridge  to  transport 
the  data  to  a  Cell  processor  somewhere  in  the  cluster 
using  Multipath  RDMA.  Since  the  host  interface  and 
the  network  bandwidth  to  a  Cell  processor  through  the 
fuU-mesh  bridge  using  Multipath  RDMA  are  greater 
than  the  sampling  rate  of  the  ADC,  the  data  is 
streamed  into  a  Cell  processor’s  memory  as  fast  as  the 
ADCs  sample  the  waveform. 

The  IBM  Cell  Broadband  Engine  is  a  multicore 
heterogeneous  processor  with  one  PowerPC  Processor 
Element  (PPE)  and  eight  Synergistic  Processor 
Elements  (SPEs).  The  most  recent  update  to  the  Cell, 
the  PowerXCell  8i  (IBM,  Los  Alamos,  New  Mexico), 
used  in  the  Los  Alamos  Roadrunner  QS22  blade 
cluster  (Koch  2007),  has  a  high  speed  FlexIO  interface 
that  supports  up  to  20  GBytes/s  I/O  bandwidth  and  a 
DDR2  memory  controller  that  supports  up  to  64 
Gigabytes  of  memory.  This  latest  update  also  boasts 
230  GFLOPS  single  precision  floating-point  perfor¬ 
mance  and  over  100  GFLOPS  double-precision 
floating  point  performance. 

In  order  to  perform  a  near-real-time  biUion-point 
CFFT  on  a  Cell,  the  processor  must  have  both  enough 
memory  to  support  the  large  data  set  and  a  significant 
amount  of  computational  power  to  perform  the 
calculations  in  the  required  time.  A  biUion-point 
single-precision  floating-point  CFFT  requires  8  Bytes 


29(3)  •  September  2008  307 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-04.3d  18/8/08  15:54:07  307 


McMillian,  Snyder,  Ferguson,  8c  McMillian 


Figure  8.  Multipath  remote  direct  memory  access  (RDMA) 
instrumentation  blade 

of  storage  per  complex  data  point  and  thus  needs  8  GB 
of  memory  for  storage.  While  the  first-generation 
Cell’s  XDR  interface  did  not  support  this  much 
memory,  the  PowerXCell  Si’s  DDR2  controller 
interface  can  easily  meet  this  requirement. 

The  Cell  provides  unmatched  computational  power  in 
a  single  chip.  With  230  GFLOPS  single-precision 
performance,  the  Cell  can  perform  a  16M-point  CFFT 
in  0.043  seconds  (Chow,  Fossum,  and  Brokenshire 
2005).  The  time  required  to  compute  a  16M-point 
CFFT  on  a  Cell  processor  can  be  scaled  to  estimate  the 
time  required  to  calculate  a  biUion-point  CFFT.  The  first 
part  of  the  calculation  involves  determining  the  relative 
complexity  factor  of  a  billion-point  CFFT  to  a  16M- 
point  CFFT.  The  complexity  of  a  CFFT  is  determined 
by  A^log2  (N),  where  N  is  the  number  of  points.  Hence 
the  relative  complexity  of  a  biUion-point  CFFT  when 
compared  to  a  16M-point  CFFT  is  [IB  log2(lB)]/[16M 
log2(16M)]  or  80.  Assuming  the  bilhon-point  CFFT  is 
as  equally  parallelizable  as  the  16M-point  CFFT  the 
time  a  Cell  takes  to  calculate  a  bilhon-point  CFFT  is  80”* 
0.043  seconds  or  3.44  seconds. 

Since  the  ADC  can  sample  one  bilhon  complex  data 
points  in  214.7  ms,  computing  the  bilhon-point  CFFT 
in  real  time  is  not  currently  possible  on  a  single  Cell. 
However,  by  distributing  the  computational  load  across 
multiple  CeU  processors,  the  CFFT  can  be  performed 
in  real-time  or  near-real-time.  A  Multipath  RDMA 
Instrumentation  system  with  one  instmmentation  blade 
and  nine  CeU  blades  containing  18  Cells  is  able  to 
compute  a  bUlion-point  CFFT  on  a  5  GSPS  complex 
signal  in  real-time. 

Conclusion 

In  conclusion.  Multipath  RDMA  is  a  new  network¬ 
ing  technology  that  exploits  multiple  routes  through  a 
fabric  to  achieve  an  order-of-magnitude  increase  in 
bandwidth  between  processors  in  instrumentation 
systems  and  supercomputing  clusters.  It  also  provides 
a  novel  means  to  interface  ultra-high-speed  instru¬ 
mentation  to  a  computing  cluster  and  makes  real-time 


or  near-real-time  data  acquisition  and  parallel  pro¬ 
cessing  possible.  Multipath  RDMA  is  a  quantum  leap 
in  the  battle  for  more  bandwidth.  While  several 
organizations  are  currently  working  on  lOOG  Ethernet 
standards.  Multipath  RDMA  offers  even  greater 
bandwidth  using  current  lOG  Ethernet  technology. 
As  Ethernet  speeds  increase.  Multipath  RDMA  wiU 
maintain  this  advantage.  Multipath  RDMA  is  an 
innovative  way  to  get  to  the  next  level  of  bandwidth 
capacity  by  exploiting  an  underutilized  resource  in  a 
blade  system,  the  passive  backplane.  □ 
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